by Andreas Johannes
1 Background
2 Data sources and treatment
3 Methodology
4 Results
5 Discussion
6 Conclusion
![]()
Parisian Rose seller, this could be you! Whether you are selling Roses to couples or playing your Fiddle for tips, you want to know where the most restaurants and bars are, because that's were the most money can be made. Read on for an depth analysis of where to go tonight to ply your trade.
PLUS if you know you made money in one area, use our similarity rating to find similar areas for your next nights work!
![]()
To find the best areas to sell Roses on the street:
Use above categories to find areas that are similar:
In this section we will execute the strategy outlined in the previous section.
import numpy as np
import pandas as pd
import folium
import pygeoj
import matplotlib.pyplot as plt
import seaborn as sns
Create a regular hexagonal grid around Paris. We wil use cube coordinates centered around the center of paris accordintg to wiki: paris. The tiles will be spaced 200 m appart and we will have 50 tiles in each direction. This covers the center of Paris quite well and should have sufficient resolution. See [https://www.redblobgames.com/grids/hexagons/] for an introduction to hexagonal coordinates.
# get a 3D grid from with 2*tile_count + 1 number of tiles across
tile_count = tc = 30
p_range, q_range, r_range = range(-tc,tc+1),range(-tc,tc+1),range(-tc,tc+1)
r_i, q_i, p_i = np.meshgrid(p_range, q_range, r_range)
pqr_i = np.stack([p_i.flat, q_i.flat, r_i.flat])
# reduce grid to include only the indexes on our hexagonal plane
hex_mask = pqr_i.sum(axis=0)==0
hex_mask
#xyz_hex = np.empty(shape=(3,hex_mask.sum()),dtype=np.int32)
pqr_hex = pqr_i[:,hex_mask]
pqr_hex.dtype, pqr_hex.T.shape
We have an index grid, not to convert it into geospacial coordinates. We want the spacing to be tile_size, and first need to convert that to angular distances. We will only cover a small segment of the sperical earth and use the apropriate simplifiations. see wiki: geographic coordinates
tile_size = ts = 200. # m
earth_radius = 6367449
center_of_paris = (48.8567, 2.3508)
# in angle per meter
lon_conversion = 180./(np.pi*earth_radius)
lat_conversion = 180/(np.pi*earth_radius)*np.cos(np.pi/180.0*center_of_paris[0])
# defining vectors to get form the center of the hex to corner points in angles
ns = 0.5*ts*lon_conversion
ew = 0.5*ts*lat_conversion
s60 = np.sin(60./180.*np.pi)
c60 = np.cos(60./180.*np.pi)
x_step = (ns, 0)
y_step = (-ns*c60, ew*s60)
z_step = (-ns*c60, -ew*s60)
step_vector = np.asarray((x_step, y_step, z_step)).T
def get_corners(step_vector, center):
'''
returns the list of coordinates for the corners of a hexagon defined by
the hexagonal step vector and a center point
'''
coordinates = []
perms = [[1,0,0],
[0,0,-1],
[0,1,0],
[-1,0,0],
[0,0,1],
[0,-1,0]]
for perm in perms:
coordinates.append(list(center + np.dot(step_vector,perm)))
return coordinates
We have all we need to create the hexagonal grid mapped over Paris.
# usefull library to create geojson files
# https://github.com/karimbahgat/PyGeoj
# creating regular tiles around city center
json_tiles = pygeoj.new()
json_tiles_fname = "tiles.geojson"
coords_str_list = []
center_list = []
p_list = []
q_list = []
r_list = []
for coords in pqr_hex.T:
# create a geojson file
coords_str=('_').join([str(x) for x in coords])
coords_str_list.append(coords_str)
p_list.append(coords[0])
q_list.append(coords[1])
r_list.append(coords[2])
center = center_of_paris[::-1] + np.dot(step_vector, coords)
center_list.append(center)
coordinates = get_corners(step_vector, center)
json_tiles.add_feature(
properties={"coords_str":coords_str},
geometry={"type":"Polygon", "coordinates":[coordinates]})
json_tiles.add_all_bboxes()
json_tiles.update_bbox()
json_tiles.add_unique_id()
json_tiles.save(json_tiles_fname)
center_list[0], center_of_paris
# create a corresponding dataframe:
center_array = np.asarray(center_list)
df_tiles = pd.DataFrame({'coords_str':coords_str_list,
'lat':center_array[:,1],
'lon':center_array[:,0]})
latdist_array = (np.asarray(df_tiles.lat)-center_of_paris[0])/lat_conversion
londist_array = (np.asarray(df_tiles.lon)-center_of_paris[1])/lon_conversion
df_tiles['distance_to_center'] = np.asarray(np.sqrt(latdist_array**2 + londist_array**2),
dtype=np.int32)
df_tiles['p'] = p_list
df_tiles['q'] = q_list
df_tiles['r'] = r_list
no_tiles = df_tiles.shape[0]
map_paris = folium.Map(location=center_of_paris, zoom_start=12)
test_df = pd.DataFrame({'Paris':1}, columns=['City','Value'])
# Add the color for the chloropleth:
folium.Choropleth(
geo_data=json_tiles_fname,
name='choropleth',
data=df_tiles,
fill_color='Blues',
columns=['coords_str', 'distance_to_center'],
key_on='feature.properties.coords_str',
fill_opacity=0.5,
line_opacity=0.1,
legend_name='Distance to Center',
).add_to(map_paris)
map_paris
df_tiles
Noe that we have the grid on which we want to check for locations, lets use foursquare to find them. We will immedeately collect different restaurant types seperately for later. see foursquare:categories
import requests
# not sharing foursquare credentials
with open('../../foursquare_credentials.dat','r') as f:
client_id, client_secret = f.readlines()
client_id = client_id[:-1]
version = '20180724'
Manual selection of some categories:
food_category = '4d4b7105d754a06374d81259' # 'Root' category for all food-related venues
nightlife_category = '4d4b7105d754a06376d81259'# 'Root' category for all nightlife venues
# other categories:
categories_dict = {'other':['503288ae91d4c4b30a586d67',
'4bf58dd8d48988d1c8941735',
'4bf58dd8d48988d14e941735',
'4bf58dd8d48988d169941735',
'52e81612bcbc57f1066b7a01',
'4bf58dd8d48988d1df931735',
'52e81612bcbc57f1066b79f4',
'4bf58dd8d48988d17a941735',
'4bf58dd8d48988d144941735',
'4bf58dd8d48988d108941735',
'4bf58dd8d48988d120951735',
'4bf58dd8d48988d1be941735',
'4bf58dd8d48988d1c1941735',
'56aa371be4b08b9a8d573508',
'4bf58dd8d48988d1c4941735',
'4bf58dd8d48988d1ce941735',
'4bf58dd8d48988d1cc941735',
'4bf58dd8d48988d1dc931735',
'56aa371be4b08b9a8d573538'],
'sweet':['4bf58dd8d48988d146941735',
'52e81612bcbc57f1066b79f2',
'4bf58dd8d48988d1d0941735',
'4bf58dd8d48988d148941735'],
'european':['52f2ae52bcbc57f1066b8b81',
'5293a7d53cf9994f4e043a45',
'4bf58dd8d48988d147941735',
'5744ccdfe4b0c0459246b4d0',
'4bf58dd8d48988d109941735',
'52e81612bcbc57f1066b7a05',
'52e81612bcbc57f1066b7a09',
'4bf58dd8d48988d10c941735',
'52e81612bcbc57f1066b79fa',
'4bf58dd8d48988d110941735',
'52e81612bcbc57f1066b79fd',
'4bf58dd8d48988d1c0941735',
'52e81612bcbc57f1066b79f9',
'4bf58dd8d48988d1c2941735',
'52e81612bcbc57f1066b7a04',
'4def73e84765ae376e57713a',
'5293a7563cf9994f4e043a44',
'4bf58dd8d48988d1c6941735',
'5744ccdde4b0c0459246b4a3',
'56aa371be4b08b9a8d57355a',
'4bf58dd8d48988d150941735',
'4bf58dd8d48988d158941735',
'4f04af1f2fb6e1c99f3db0bb',
'52e928d0bcbc57f1066b7e96'],
'asian':['4bf58dd8d48988d142941735',
'4bf58dd8d48988d10f941735',
'4bf58dd8d48988d115941735',
'52e81612bcbc57f1066b79f8',
'5413605de4b0ae91d18581a9'],
'fast':['4bf58dd8d48988d179941735',
'4bf58dd8d48988d16a941735',
'52e81612bcbc57f1066b7a02',
'52e81612bcbc57f1066b79f1',
'4bf58dd8d48988d143941735',
'52e81612bcbc57f1066b7a0c',
'4bf58dd8d48988d16c941735',
'4bf58dd8d48988d128941735',
'4bf58dd8d48988d16d941735',
'4bf58dd8d48988d1e0931735',
'52e81612bcbc57f1066b7a00',
'4bf58dd8d48988d10b941735',
'4bf58dd8d48988d16e941735',
'4edd64a0c7ddd24ca188df1a',
'56aa371be4b08b9a8d57350b',
'4bf58dd8d48988d1cb941735',
'4d4ae6fc7a7b7dea34424761',
'5283c7b4e4b094cb91ec88d7',
'4bf58dd8d48988d1ca941735',
'4bf58dd8d48988d1c5941735',
'4bf58dd8d48988d1bd941735',
'4bf58dd8d48988d1c7941735',
'4bf58dd8d48988d1dd931735'],
'night_life':['52e81612bcbc57f1066b7a06',
nightlife_category]}
# fix feature order:
feature_list = list(categories_dict.keys())
feature_list.sort()
feature_list
Unfortunately, some of these are parent categories which will not be correctly counted if we dont poll all childern. Therefore, we need to delve a little deeper intho the foursquare category system. As example the nightlife category has some subcategories.
get_categories_url = 'https://api.foursquare.com/v2/venues/categories?&client_id={}&client_secret={}&v={}'.format(
client_id, client_secret, version)
all_foursquare_categories = requests.get(get_categories_url).json()['response']['categories']
def get_category_by_id(parent, category_id, result=None):
if result == None:
if type(parent) == list:
for parent_category in parent:
result = get_category_by_id(parent_category, category_id, result)
elif type(parent) == dict:
if parent['id'] == category_id:
return parent
elif len(parent['categories'])!=0:
for item in parent['categories']:
result = get_category_by_id(item, category_id, result)
else:
result = None
return result
else:
return result
nightlife_categories = get_category_by_id(all_foursquare_categories, nightlife_category)
def get_descendant_categories(parent, categories=[], verbose=False):
if type(parent) == list:
for parent_category in parent:
categories = get_descendant_categories(parent_category, categories, verbose)
return categories
elif type(parent) == dict:
if verbose:
print(parent['name'], len(categories)+1)
categories.append(parent['id'])
if len(parent['categories'])==0:
return categories
else:
for item in parent['categories']:
categories = get_descendant_categories(item, categories, verbose)
return categories
nl = get_descendant_categories(nightlife_categories, [], verbose=True)
len(nl)
Now we can iterate over the above manually created categories dict to get a a comprehensive set of all related categories' id's.
catsets_dict = {}
for key, categories in categories_dict.items():
key_list = []
for cat_id in categories:
parent_cat = get_category_by_id(all_foursquare_categories, cat_id)
key_list += get_descendant_categories(parent=parent_cat, categories=key_list, verbose=False)
catsets_dict.update({key:set(key_list)})
for key, val in catsets_dict.items():
print("category {} has {} id's".format(key, len(val)))
We define the functions that will GET the Foursquare data for each area around Paris and filter the categories. We choose a radius of 250 m which causes a little overlap. Double-counting a few venues should not critically affect the this analysis.
def get_categories(categories):
return [cat['id'] for cat in categories]
def count_categories(catsets_dict, found_categories):
result = dict([[x,0] for x in catsets_dict.keys()])
for key, categories_list in catsets_dict.items():
for found_id_list in found_categories:
for found_id in found_id_list:
if found_id in categories_list:
result[key] += 1
return result
def get_venues_near_location(lat, lon, client_id, client_secret, radius=250, limit=100):
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
client_id, client_secret, version, lat, lon, radius, limit)
response_json = requests.get(url).json()['response']
key_list = response_json.keys()
try:
item_list = response_json['groups'][0]['items']
venue_categories = [get_categories(item['venue']['categories']) for item in item_list]
except KeyError:
venue_categories = []
return venue_categories, response_json
Unfortunately, the open account at foursquare only grants 950 regular calls to its server per day. Since we have 2791 tiles in our grid, we will need to retrieve and save the data to disc on several days.
try:
# read file if it already exists
df_tiles=pd.read_csv('tiles.csv')
unnamed_cols = [c for c in df_tiles.columns if str(c).find('named:')>0]
df_tiles.drop(unnamed_cols,axis=1,inplace=True)
print('data read')
except:
# else continue the file is created later
for key in feature_list:
df_tiles[key] = np.zeros(no_tiles)
df_tiles['dl_done'] = np.zeros(no_tiles)
print('no data read')
Now we need to iterate the data getter function over all the remaining tiles plotted on the map. Those are the ones were the dl_done column entry is 0.
done_mask = df_tiles['dl_done'].to_numpy()
no_remaining_tiles = no_tiles - done_mask.sum()
index_coord = np.stack(zip(range(no_tiles),center_array))
to_do = max(700,no_remaining_tiles)
remaining_coords = index_coord[np.where(done_mask==0)]
if len(remaining_coords)==0:
print('all data already downloaded')
counts_array = np.zeros((no_tiles, len(feature_list)+1))
for i, key in enumerate(feature_list+['dl_done']):
counts_array[:,i] = df_tiles[key]
for i, coord in remaining_coords[:to_do]:
print('{}'.format(i))
foursquare_result = get_venues_near_location(lat=coord[1],
lon=coord[0],
client_id=client_id,
client_secret=client_secret,
radius=250, limit=100)
found_categories = foursquare_result[0]
counts = count_categories(catsets_dict=catsets_dict, found_categories=found_categories)
for j, key in enumerate(feature_list):
counts_array[i,j] = counts[key]
counts_array[i,-1] = 1
Lets update our pandas dataframe, and save it to disk, so we dont need to recall foursquare unneccessary:
for j, key in enumerate(feature_list+['dl_done']):
df_tiles[key] = counts_array[:,j]
df_tiles['all']=counts_array[:,:].sum(axis=1)
unnamed_cols = [c for c in df_tiles.columns if str(c).find('named:')>0]
df_tiles.drop(unnamed_cols,axis=1,inplace=True)
df_tiles.to_csv('tiles.csv', index=True, header=True)
df_tiles=pd.read_csv('tiles.csv')
unnamed_cols = [c for c in df_tiles.columns if str(c).find('named:')>0]
df_tiles.drop(unnamed_cols,axis=1,inplace=True)
df_tiles.head()
Lets define an order of features and associated color code:
feature_list = ['european',
'fast',
'night_life',
'asian',
'other',
'sweet',
'all']
color_list = ['Blues',
'Reds',
'Greens',
'RdPu',
'Purples',
'Oranges',
'Greys']
Lets see how many restaurants we tend to find:
for feature in feature_list:
print('max of {} is {}'.format(feature, df_tiles[feature].max()))
This looks very reasonable. We get sufficiently large counts, but not too large indicating that the spatial and topical cathegories are a well proportioned.
df_features = df_tiles[feature_list]
df_features.head()
f, ax = plt.subplots(figsize=(12, 8))
sns.boxplot(data=df_features, palette=['blue','red','green','magenta','navy','orange','grey'])
We see we have some areas with very high numbers and diversity of restaurants. This follows from the fact that the sum category 'all' is much greater than each individual category. However, all categories have some areas where they have large counts and esp. 'european', 'fast', 'asian' and 'night_life' have a non-negligible median value. All of these factors point to a promising dataset for further evaluation.
The next step will be to plot the heat maps for all restaurant and bar categories around Paris!
map_paris = folium.Map(location=center_of_paris, zoom_start=13)
# Add the color for the chloropleth:
for name, color in zip(feature_list, color_list):
folium.Choropleth(
geo_data=json_tiles_fname,
name=name,
data=df_tiles,
columns=['coords_str', name],
key_on='feature.properties.coords_str',
fill_color=color,
fill_opacity=0.4,
line_opacity=0.0,
legend_name='Number of {} venues'.format(name),
).add_to(map_paris)
folium.LayerControl().add_to(map_paris)
map_paris